INTERSPEECH.2020 - Analysis and Assessment | Cool Papers

#1 RECOApy: Data Recording, Pre-Processing and Phonetic Transcription for End-to-End Speech-Based Applications [PDF] [Copy] [Kimi¹]

Deep learning enables the development of efficient end-to-end speech processing applications while bypassing the need for expert linguistic and signal processing features. Yet, recent studies show that good quality speech resources and phonetic transcription of the training data can enhance the results of these applications. In this paper, the RECOApy tool is introduced. RECOApy streamlines the steps of data recording and pre-processing required in end-to-end speech-based applications. The tool implements an easy-to-use interface for prompted speech recording, spectrogram and waveform analysis, utterance-level normalisation and silence trimming, as well grapheme-to-phoneme conversion of the prompts in eight languages: Czech, English, French, German, Italian, Polish, Romanian and Spanish. The grapheme-to-phoneme (G2P) converters are deep neural network (DNN) based architectures trained on lexicons extracted from the Wiktionary online collaborative resource. With the different degree of orthographic transparency, as well as the varying amount of phonetic entries across the languages, the DNN’s hyperparameters are optimised with an evolution strategy. The phoneme and word error rates of the resulting G2P converters are presented and discussed. The tool, the processed phonetic lexicons and trained G2P models are made freely available.

#2 Analyzing the Quality and Stability of a Streaming End-to-End On-Device Speech Recognizer [PDF] [Copy] [Kimi¹]

Authors: Yuan Shangguan ; Kate Knister ; Yanzhang He ; Ian McGraw ; Françoise Beaufays

The demand for fast and accurate incremental speech recognition increases as the applications of automatic speech recognition (ASR) proliferate. Incremental speech recognizers output chunks of partially recognized words while the user is still talking. Partial results can be revised before the ASR finalizes its hypothesis, causing instability issues. We analyze the quality and stability of on-device streaming end-to-end (E2E) ASR models. We first introduce a novel set of metrics that quantify the instability at word and segment levels. We study the impact of several model training techniques that improve E2E model qualities but degrade model stability. We categorize the causes of instability and explore various solutions to mitigate them in a streaming E2E ASR system.

#3 Statistical Testing on ASR Performance via Blockwise Bootstrap [PDF] [Copy] [Kimi¹]

Authors: Zhe Liu ; Fuchun Peng

A common question being raised in automatic speech recognition (ASR) evaluations is how reliable is an observed word error rate (WER) improvement comparing two ASR systems, where statistical hypothesis testing and confidence interval (CI) can be utilized to tell whether this improvement is real or only due to random chance. The bootstrap resampling method has been popular for such significance analysis which is intuitive and easy to use. However, this method fails in dealing with dependent data, which is prevalent in speech world — for example, ASR performance on utterances from the same speaker could be correlated. In this paper we present blockwise bootstrap approach — by dividing evaluation utterances into nonoverlapping blocks, this method resamples these blocks instead of original data. We show that the resulting variance estimator of absolute WER difference between two ASR systems is consistent under mild conditions. We also demonstrate the validity of blockwise bootstrap method on both synthetic and real-world speech data.

#4 Sentence Level Estimation of Psycholinguistic Norms Using Joint Multidimensional Annotations [PDF] [Copy] [Kimi¹]

Authors: Anil Ramakrishna ; Shrikanth Narayanan

Psycholinguistic normatives represent various affective and mental constructs using numeric scores and are used in a variety of applications in natural language processing. They are commonly used at the sentence level, the scores of which are estimated by extrapolating word level scores using simple aggregation strategies, which may not always be optimal. In this work, we present a novel approach to estimate the psycholinguistic norms at sentence level. We apply a multidimensional annotation fusion model on annotations at the word level to estimate a parameter which captures relationships between different norms. We then use this parameter at sentence level to estimate the norms. We evaluate our approach by predicting sentence level scores for various normative dimensions and compare with standard word aggregation schemes.

#5 Neural Zero-Inflated Quality Estimation Model for Automatic Speech Recognition System [PDF] [Copy] [Kimi¹]

Authors: Kai Fan ; Bo Li ; Jiayi Wang ; Shiliang Zhang ; Boxing Chen ; Niyu Ge ; Zhijie Yan

The performances of automatic speech recognition (ASR) systems are usually evaluated by the metric word error rate (WER) when the manually transcribed data are provided, which are, however, expensively available in the real scenario. In addition, the empirical distribution of WER for most ASR systems usually tends to put a significant mass near zero, making it difficult to simulate with a single continuous distribution. In order to address the two issues of ASR quality estimation (QE), we propose a novel neural zero-inflated model to predict the WER of the ASR result without transcripts. We design a neural zero-inflated beta regression on top of a bidirectional transformer language model conditional on speech features (speech-BERT). We adopt the pre-training strategy of token level masked language modeling for speech-BERT as well, and further fine-tune with our zero-inflated layer for the mixture of discrete and continuous outputs. The experimental results show that our approach achieves better performance on WER prediction compared with strong baselines.

#6 Confidence Measures in Encoder-Decoder Models for Speech Recognition [PDF] [Copy] [Kimi¹]

Authors: Alejandro Woodward ; Clara Bonnín ; Issey Masuda ; David Varas ; Elisenda Bou-Balust ; Juan Carlos Riveiro

Recent improvements in Automatic Speech Recognition (ASR) systems have enabled the growth of myriad applications such as voice assistants, intent detection, keyword extraction and sentiment analysis. These applications, which are now widely used in the industry, are very sensitive to the errors generated by ASR systems. This could be overcome by having a reliable confidence measurement associated to the predicted output. This work presents a novel method which uses internal neural features of a frozen ASR model to train an independent neural network to predict a softmax temperature value. This value is computed in each decoder time step and multiplied by the logits in order to redistribute the output probabilities. The resulting softmax values corresponding to predicted tokens constitute a more reliable confidence measure. Moreover, this work also studies the effect of teacher forcing on the training of the proposed temperature prediction module. The output confidence estimation shows an improvement of -25.78% in EER and +7.59% in AUC-ROC with respect to the unaltered softmax values of the predicted tokens, evaluated on a proprietary dataset consisting on News and Entertainment videos.

#7 Word Error Rate Estimation Without ASR Output: e-WER2 [PDF] [Copy] [Kimi¹]

Authors: Ahmed Ali ; Steve Renals

Measuring the performance of automatic speech recognition (ASR) systems requires manually transcribed data in order to compute the word error rate (WER), which is often time-consuming and expensive. In this paper, we continue our effort in estimating WER using acoustic, lexical and phonotactic features. Our novel approach to estimate the WER uses a multistream end-to-end architecture. We report results for systems using internal speech decoder features (glass-box), systems without speech decoder features (black-box), and for systems without having access to the ASR system (no-box). The no-box system learns joint acoustic-lexical representation from phoneme recognition results along with MFCC acoustic features to estimate WER. Considering WER per sentence, our no-box system achieves 0.56 Pearson correlation with the reference evaluation and 0.24 root mean square error (RMSE) across 1,400 sentences. The estimated overall WER by e-WER2 is 30.9% for a three hours test set, while the WER computed using the reference transcriptions was 28.5%.

#8 An Evaluation of Manual and Semi-Automatic Laughter Annotation [PDF] [Copy] [Kimi¹]

Authors: Bogdan Ludusan ; Petra Wagner

With laughter research seeing a development in recent years, there is also an increased need in materials having laughter annotations. We examine in this study how one can leverage existing spontaneous speech resources to this goal. We first analyze the process of manual laughter annotation in corpora, by establishing two important parameters of the process: the amount of time required and its inter-rater reliability. Next, we propose a novel semi-automatic tool for laughter annotation, based on a signal-based representation of speech rhythm. We test both annotation approaches on the same recordings, containing German dyadic spontaneous interactions, and employing a larger pool of annotators than previously done. We then compare and discuss the obtained results based on the two aforementioned parameters, highlighting the benefits and costs associated to each approach.

#9 Understanding Racial Disparities in Automatic Speech Recognition: The Case of Habitual “be” [PDF] [Copy] [Kimi¹]

Authors: Joshua L. Martin ; Kevin Tang

Recent research has highlighted that state-of-the-art automatic speech recognition (ASR) systems exhibit a bias against African American speakers. In this research, we investigate the underlying causes of this racially based disparity in performance, focusing on a unique morpho-syntactic feature of African American English (AAE), namely habitual “be”, an invariant form of “be” that encodes the habitual aspect. By looking at over 100 hours of spoken AAE, we evaluated two ASR systems — DeepSpeech and Google Cloud Speech — to examine how well habitual “be” and its surrounding contexts are inferred. While controlling for local language and acoustic factors such as the amount of context, noise, and speech rate, we found that habitual “be” and its surrounding words were more error prone than non-habitual “be” and its surrounding words. These findings hold both when the utterance containing “be” is processed in isolation and in conjunction with surrounding utterances within speaker turn. Our research highlights the need for equitable ASR systems to take into account dialectal differences beyond acoustic modeling.

#10 Improving X-Vector and PLDA for Text-Dependent Speaker Verification [PDF] [Copy] [Kimi¹]

Authors: Zhuxin Chen ; Yue Lin

Recently, the pipeline consisting of an x-vector speaker embedding front-end and a Probabilistic Linear Discriminant Analysis (PLDA) back-end has achieved state-of-the-art results in text-independent speaker verification. In this paper, we further improve the performance of x-vector and PLDA based system for text-dependent speaker verification by exploring the choice of layer to produce embedding and modifying the back-end training strategies. In particular, we probe that x-vector based embeddings, specifically the standard deviation statistics in the pooling layer, contain the information related to both speaker characteristics and spoken content. Accordingly, we modify the back-end training labels by utilizing both of the speaker-id and phrase-id. A correlation-alignment-based PLDA adaptation is also adopted to make use of the text-independent labeled data during back-end training. Experimental results on the SDSVC 2020 dataset show that our proposed methods achieve significant performance improvement compared with the x-vector and HMM based i-vector baselines.

#11 SdSV Challenge 2020: Large-Scale Evaluation of Short-Duration Speaker Verification [PDF] [Copy] [Kimi¹]

Authors: Hossein Zeinali ; Kong Aik Lee ; Jahangir Alam ; Lukáš Burget

Modern approaches to speaker verification represent speech utterances as fixed-length embeddings. With these approaches, we implicitly assume that speaker characteristics are independent of the spoken content. Such an assumption generally holds when sufficiently long utterances are given. In this context, speaker embeddings, like i-vector and x-vector, have shown to be extremely effective. For speech utterances of short duration (in the order of a few seconds), speaker embeddings have shown significant dependency on the phonetic content. In this regard, the SdSV Challenge 2020 was organized with a broad focus on systematic benchmark and analysis on varying degrees of phonetic variability on short-duration speaker verification (SdSV). In addition to text-dependent and text-independent tasks, the challenge features an unusual and difficult task of cross-lingual speaker verification (English vs. Persian). This paper describes the dataset and tasks, the evaluation rules and protocols, the performance metric, baseline systems, and challenge results. We also present insights gained from the evaluation and future research directions.

#12 The XMUSPEECH System for Short-Duration Speaker Verification Challenge 2020 [PDF] [Copy] [Kimi¹]

Authors: Tao Jiang ; Miao Zhao ; Lin Li ; Qingyang Hong

In this paper, we present our XMUSPEECH system for Task 1 in the Short-duration Speaker Verification (SdSV) Challenge. In this challenge, Task 1 is a Text-Dependent (TD) mode where speaker verification systems are required to automatically determine whether a test segment with specific phrase belongs to the target speaker. We leveraged the system pipeline from three aspects, including the data processing, front-end training and back-end processing. In addition, we have explored some training strategies such as spectrogram augmentation and transfer learning. The experimental results show that the attempts we had done are effective and our best single system, a transferred model with spectrogram augmentation and attentive statistic pooling, significantly outperforms the official baseline on both progress subset and evaluation subset. Finally, a fusion of seven subsystems are chosen as our primary system which yielded 0.0856 and 0.0862 in term of minDCF, for the progress subset and evaluation subset respectively.

#13 Robust Text-Dependent Speaker Verification via Character-Level Information Preservation for the SdSV Challenge 2020 [PDF] [Copy] [Kimi¹]

Authors: Sung Hwan Mun ; Woo Hyun Kang ; Min Hyun Han ; Nam Soo Kim

This paper describes our submission to Task 1 of the Short-duration Speaker Verification (SdSV) challenge 2020. Task 1 is a text-dependent speaker verification task, where both the speaker and phrase are required to be verified. The submitted systems were composed of TDNN-based and ResNet-based front-end architectures, in which the frame-level features were aggregated with various pooling methods (e.g., statistical, self-attentive, ghostVLAD pooling). Although the conventional pooling methods provide embeddings with a sufficient amount of speaker-dependent information, our experiments show that these embeddings often lack phrase-dependent information. To mitigate this problem, we propose a new pooling and score compensation methods that leverage a CTC-based automatic speech recognition (ASR) model for taking the lexical content into account. Both methods showed improvement over the conventional techniques, and the best performance was achieved by fusing all the experimented systems, which showed 0.0785% MinDCF and 2.23% EER on the challenge’s evaluation subset.

#14 The TalTech Systems for the Short-Duration Speaker Verification Challenge 2020 [PDF] [Copy] [Kimi¹]

Authors: Tanel Alumäe ; Jörgen Valk

This paper presents the Tallinn University of Technology systems submitted to the Short-duration Speaker Verification Challenge 2020. The challenge consists of two tasks, focusing on text-dependent and text-independent speaker verification with some cross-lingual aspects. We used speaker embedding models that consist of squeeze-and-attention based residual layers, multi-head attention and either cross-entropy-based or additive angular margin based objective function. In order to encourage the model to produce language-independent embeddings, we trained the models in a multi-task manner, using dataset specific output layers. In the text-dependent task we employed a phrase classifier to reject trials with non-matching phrases. In the text-independent task we used a language classifier to boost the scores of trials where the language of the test and enrollment utterances does not match. Our final primary metric score was 0.075 in Task 1 (ranked as 6th) and 0.118 in Task 2 (rank 8).

#15 Investigation of NICT Submission for Short-Duration Speaker Verification Challenge 2020 [PDF] [Copy] [Kimi¹]

Authors: Peng Shen ; Xugang Lu ; Hisashi Kawai

In this paper, we describe the NICT speaker verification system for the text-independent task of the short-duration speaker verification (SdSV) challenge 2020. We firstly present the details of the training data and feature preparation. Then, x-vector-based front-ends by considering different network configurations, back-ends of probabilistic linear discriminant analysis (PLDA), simplified PLDA, cosine similarity, and neural network-based PLDA are investigated and explored. Finally, we apply a greedy fusion and calibration approach to select and combine the subsystems. To improve the performance of the speaker verification system on short-duration evaluation data, we introduce our investigations on how to reduce the duration mismatch between training and test datasets. Experimental results showed that our primary fusion yielded minDCF of 0.074 and EER of 1.50 on the evaluation subset, which was the 2nd best result in the text-independent speaker verification task.

#16 Cross-Lingual Speaker Verification with Domain-Balanced Hard Prototype Mining and Language-Dependent Score Normalization [PDF] [Copy] [Kimi¹]

Authors: Jenthe Thienpondt ; Brecht Desplanques ; Kris Demuynck

In this paper we describe the top-scoring IDLab submission for the text-independent task of the Short-duration Speaker Verification (SdSV) Challenge 2020. The main difficulty of the challenge exists in the large degree of varying phonetic overlap between the potentially cross-lingual trials, along with the limited availability of in-domain DeepMine Farsi training data. We introduce domain-balanced hard prototype mining to finetune the state-of-the-art ECAPA-TDNN x-vector based speaker embedding extractor. The sample mining technique efficiently exploits speaker distances between the speaker prototypes of the popular AAM-softmax loss function to construct challenging training batches that are balanced on the domain-level. To enhance the scoring of cross-lingual trials, we propose a language-dependent s-norm score normalization. The imposter cohort only contains data from the Farsi target-domain which simulates the enrollment data always being Farsi. In case a Gaussian-Backend language model detects the test speaker embedding to contain English, a cross-language compensation offset determined on the AAM-softmax speaker prototypes is subtracted from the maximum expected imposter mean score. A fusion of five systems with minor topological tweaks resulted in a final MinDCF and EER of 0.065 and 1.45% respectively on the SdSVC evaluation set.

#17 BUT Text-Dependent Speaker Verification System for SdSV Challenge 2020 [PDF] [Copy] [Kimi¹]

Authors: Alicia Lozano-Diez ; Anna Silnova ; Bhargav Pulugundla ; Johan Rohdin ; Karel Veselý ; Lukáš Burget ; Oldřich Plchot ; Ondřej Glembek ; Ondvrej Novotný ; Pavel Matějka

In this paper, we present the winning BUT submission for the text-dependent task of the SdSV challenge 2020. Given the large amount of training data available in this challenge, we explore successful techniques from text-independent systems in the text-dependent scenario. In particular, we trained x-vector extractors on both in-domain and out-of-domain datasets and combine them with i-vectors trained on concatenated MFCCs and bottleneck features, which have proven effective for the text-dependent scenario. Moreover, we proposed the use of phrase-dependent PLDA backend for scoring and its combination with a simple phrase recognizer, which brings up to 63% relative improvement on our development set with respect to using standard PLDA. Finally, we combine our different i-vector and x-vector based systems using a simple linear logistic regression score level fusion, which provides 28% relative improvement on the evaluation set with respect to our best single system.

#18 Exploring the Use of an Unsupervised Autoregressive Model as a Shared Encoder for Text-Dependent Speaker Verification [PDF] [Copy] [Kimi¹]

Authors: Vijay Ravi ; Ruchao Fan ; Amber Afshan ; Huanhua Lu ; Abeer Alwan

In this paper, we propose a novel way of addressing text-dependent automatic speaker verification (TD-ASV) by using a shared-encoder with task-specific decoders. An autoregressive predictive coding (APC) encoder is pre-trained in an unsupervised manner using both out-of-domain (LibriSpeech, VoxCeleb) and in-domain (DeepMine) unlabeled datasets to learn generic, high-level feature representation that encapsulates speaker and phonetic content. Two task-specific decoders were trained using labeled datasets to classify speakers (SID) and phrases (PID). Speaker embeddings extracted from the SID decoder were scored using a PLDA. SID and PID systems were fused at the score level. There is a 51.9% relative improvement in minDCF for our system compared to the fully supervised x-vector baseline on the cross-lingual DeepMine dataset. However, the i-vector/HMM method outperformed the proposed APC encoder-decoder system. A fusion of the x-vector/PLDA baseline and the SID/PLDA scores prior to PID fusion further improved performance by 15% indicating complementarity of the proposed approach to the x-vector system. We show that the proposed approach can leverage from large, unlabeled, data-rich domains, and learn speech patterns independent of downstream tasks. Such a system can provide competitive performance in domain-mismatched scenarios where test data is from data-scarce domains.

#19 Smart Tube: A Biofeedback System for Vocal Training and Therapy Through Tube Phonation [PDF] [Copy] [Kimi¹]

Authors: Naoko Kawamura ; Tatsuya Kitamura ; Kenta Hamada

Tube phonation, or straw phonation, is a frequently used vocal training technique to improve the efficiency of the vocal mechanism by repeatedly producing a speech sound into a tube or straw. Use of the straw results in a semi-occluded vocal tract in order to maximize the interaction between the vocal fold vibration and the vocal tract. This method requires a voice trainer or therapist to raise the trainee or patient’s awareness of the vibrations around his or her mouth, guiding him/her to maximize the vibrations, which results in efficient phonation. A major problem with this process is that the trainer cannot monitor the trainee/patient’s vibratory state in a quantitative manner. This study proposes the use of Smart Tube, a straw with an attached acceleration sensor and LED strip that can measure vibrations and provide corresponding feedback through LED lights in real-time. The biofeedback system was implemented using a microcontroller board, Arduino Uno, to minimize cost. Possible system function enhancements include Bluetooth compatibility with personal computers and/or smartphones. Smart Tube can facilitate improved phonation for trainees/patients by providing quantitative visual feedback.

#20 VCTUBE : A Library for Automatic Speech Data Annotation [PDF] [Copy] [Kimi¹]

Authors: Seong Choi ; Seunghoon Jeong ; Jeewoo Yoon ; Migyeong Yang ; Minsam Ko ; Eunil Park ; Jinyoung Han ; Munyoung Lee ; Seonghee Lee

We introduce an open-source Python library, VCTUBE, which can automatically generate <audio, text> pair of speech data from a given Youtube URL. We believe VCTUBE is useful for collecting, processing, and annotating speech data easily toward developing speech synthesis systems.

#21 A Mandarin L2 Learning APP with Mispronunciation Detection and Feedback [PDF] [Copy] [Kimi¹]

Authors: Yanlu Xie ; Xiaoli Feng ; Boxue Li ; Jinsong Zhang ; Yujia Jin

In this paper, an APP with Mispronunciation Detection and Feedback for Mandarin L2 Learners is shown. The APP could detect the mispronunciation in the words and highlight it with red at the phone level. Also, the score will be shown to evaluate the overall pronunciation. When touching the highlight, the pronunciation of the learner’s and the standard’s is played. Then the flash animation that describes the movement of the tongue, mouth, and other articulators will be shown to the learner. The learner could repeat the process to improve and excise the pronunciation. The App called ‘SAIT Hànyǔ’ can be downloaded at App Store.

#22 Rapid Enhancement of NLP Systems by Acquisition of Data in Correlated Domains [PDF] [Copy] [Kimi¹]

Authors: Tejas Udayakumar ; Kinnera Saranu ; Mayuresh Sanjay Oak ; Ajit Ashok Saunshikar ; Sandip Shriram Bapat

In a generation where industries are going through a paradigm shift because of the rampant growth of deep learning, structured data plays a crucial role in the automation of various tasks. Textual structured data is one such kind which is extensively used in systems like chat bots and automatic speech recognition. Unfortunately, a majority of these textual data available is unstructured in the form of user reviews and feedback, social media posts etc. Automating the task of categorizing or clustering these data into meaningful domains will reduce the time and effort needed in building sophisticated human-interactive systems. In this paper, we present a web tool that builds a domain specific data based on a search phrase from a database of highly unstructured user utterances. We also show the usage of Elasticsearch database with custom indexes for full correlated text-search. This tool uses the open sourced Glove model combined with cosine similarity and performs a graph based search to provide semantically and syntactically meaningful corpora. In the end, we discuss its applications with respect to natural language processing.

#23 Computer-Assisted Language Learning System: Automatic Speech Evaluation for Children Learning Malay and Tamil [PDF] [Copy] [Kimi¹]

Authors: Ke Shi ; Kye Min Tan ; Richeng Duan ; Siti Umairah Md. Salleh ; Nur Farah Ain Suhaimi ; Rajan Vellu ; Ngoc Thuy Huong Helen Thai ; Nancy F. Chen

We present a computer-assisted language learning system that automatically evaluates the pronunciation and fluency of spoken Malay and Tamil. Our system consists of a server and a user-facing Android application, where the server is responsible for speech-to-text alignment as well as pronunciation and fluency scoring. We describe our system architecture and discuss the technical challenges associated with low resource languages. To the best of our knowledge, this work is the first pronunciation and fluency scoring system for Malay and Tamil.

#24 Real-Time, Full-Band, Online DNN-Based Voice Conversion System Using a Single CPU [PDF] [Copy] [Kimi¹]

Authors: Takaaki Saeki ; Yuki Saito ; Shinnosuke Takamichi ; Hiroshi Saruwatari

We present a real-time, full-band, online voice conversion (VC) system that uses a single CPU. For practical applications, VC must be high quality and able to perform real-time, online conversion with fewer computational resources. Our system achieves this by combining non-linear conversion with a deep neural network and short-tap, sub-band filtering. We evaluate our system and demonstrate that it 1) achieves the estimated complexity around 2.5 GFLOPS and measures real-time factor (RTF) around 0.5 with a single CPU and 2) can attain converted speech with a 3.4 / 5.0 mean opinion score (MOS) of naturalness.

#25 A Dynamic 3D Pronunciation Teaching Model Based on Pronunciation Attributes and Anatomy [PDF] [Copy] [Kimi¹]

Authors: Xiaoli Feng ; Yanlu Xie ; Yayue Deng ; Boxue Li

In this paper, a dynamic three dimensional (3D) head model is introduced which is built based on knowledge of (the human) anatomy and the theory of distinctive features. The model is used to help Chinese learners understand the exact location and method of the phoneme articulation intuitively. You can access the phonetic learning system, choose the target sound you want to learn and then watch the 3D dynamic animations of the phonemes. You can look at the lips, tongue, soft palate, uvula, and other dynamic vocal organs as well as teeth, gums, hard jaw, and other passive vocal organs from different angles. In this process, you can make the skin and some of the muscles semi-transparent, or zoom in or out the model to see the dynamic changes of articulators clearly. By looking at the 3D model, learners can find the exact location of each sound and imitate the pronunciation actions.